Add fallback pruning for stalled finalization#179
Conversation
When finalization stalls (e.g., network issues, low participation), the States and Block tables grow unbounded because retention-based pruning only triggers on finalization advancement. This adds a periodic check every 7200 slots (~8 hours) that calls prune_old_states and prune_old_blocks directly when finalization is far behind, adapted from Zeam's approach. The check is cheap (two integer comparisons per slot-0 tick) and the actual pruning reuses existing idempotent methods.
🤖 Kimi Code ReviewReview SummaryThe PR introduces a fallback pruning mechanism for when finalization is stalled, which is a good defensive measure against unbounded storage growth. However, there are several issues that need attention: Critical Issues
Security & Correctness Issues
Performance & Best Practices
Suggested Fixes// In blockchain/src/store.rs, fix the race condition:
let current_finalized_slot = store.latest_finalized().slot;
if slot.is_multiple_of(PRUNING_FALLBACK_INTERVAL_SLOTS)
&& slot.saturating_sub(current_finalized_slot) > PRUNING_FALLBACK_INTERVAL_SLOTS
{
// ... rest of code
}
// In storage/src/store.rs, fix atomic reads:
pub fn periodic_prune(&mut self) {
let (finalized_checkpoint, justified_checkpoint) = {
let finalized = self.latest_finalized();
let justified = self.latest_justified();
(finalized, justified)
};
let protected_roots = [finalized_checkpoint.root, justified_checkpoint.root];
// ... rest of code
}Automated review by Kimi (Moonshot AI) · custom prompt |
🤖 Codex Code Review
Could not run tests in this environment due rustup permission errors ( Automated review by OpenAI Codex · custom prompt |
Greptile SummaryThis PR addresses a gap in PR #170's retention-based pruning: because Key points:
Confidence Score: 4/5
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["on_tick — interval 0"] --> B{slot > 0?}
B -- No --> G[Skip pruning]
B -- Yes --> C{slot % PRUNING_FALLBACK_INTERVAL_SLOTS == 0?}
C -- No --> G
C -- Yes --> D{"slot - finalized_slot\n> PRUNING_FALLBACK_INTERVAL_SLOTS?"}
D -- No\nFinalization healthy --> G
D -- Yes\nFinalization stalled --> E["warn! log stall event"]
E --> F["store.periodic_prune()"]
F --> F1["protected = [finalized.root, justified.root]"]
F1 --> F2["prune_old_states(protected)\nprune_old_blocks(protected)"]
F2 --> F3{anything pruned?}
F3 -- Yes --> F4["info! log pruned counts"]
F3 -- No --> G
Last reviewed commit: d6c7958 |
🤖 Claude Code ReviewHere is my review of PR #179: Review: Add periodic fallback pruning for stalled finalizationThe motivation is clear and the implementation is straightforward. A few points to consider: Notable:
|
…s 3 redundant DB reads for the same checkpoint value that doesnt change between calls in the interval-0 handler.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Two tests that verify the 3-part guard in on_tick interval 0:
- on_tick_triggers_periodic_pruning_when_finalization_stalled: pruning
fires when slot is a multiple of the interval AND finalization lag
exceeds the threshold
- on_tick_skips_periodic_pruning_when_finalization_healthy: pruning does
NOT fire when finalization is recent (lag within threshold)
Also exports Table, StorageWriteBatch, and StorageReadView from the
storage crate to enable direct backend access in blockchain tests.
in crates/storage/src/store.rs (constants and test sections coexist, periodic pruning tests updated to use test_store_with_backend helper)
periodic_prune now takes a slot parameter and handles the 3-part trigger condition (slot > 0, multiple of interval, finalization lag) internally. The blockchain on_tick handler just calls store.periodic_prune(slot).
Check cheap arithmetic conditions (slot == 0, is_multiple_of) before reading latest_finalized from the database. This avoids a wasted DB read on 7199 out of 7200 calls.
Fallback pruning now lives entirely in the storage module's update_checkpoints method as an else branch: when finalization doesn't advance, it checks the head slot and prunes if finalization is stalled. This keeps the consensus layer completely unaware of pruning logic. Removes periodic_prune as a standalone method, the on_tick call, and the blockchain-level pruning tests (now covered by storage tests).
…7200 slots Remove the interval-based guard and PRUNING_FALLBACK_INTERVAL_SLOTS constant. The else branch in update_checkpoints now unconditionally calls prune_old_states and prune_old_blocks when finalization doesn't advance. The prune methods are already no-ops when within retention limits, so this is safe and more testable.
Motivation
PR #170 adds retention-based pruning for blocks and states, but it only triggers when finalization
advances (
finalized.slot > old_finalized_slotinupdate_checkpoints). If finalization stalls —due to network issues, low participation, or other consensus failures — the
StatesandBlockHeaders/BlockBodies/BlockSignaturestables grow unbounded because the pruning conditionis never met.
Description
Adds an
elsebranch toupdate_checkpointsin the storage module that callsprune_old_statesand
prune_old_blockson every head update when finalization doesn't advance. The prune methodsare already no-ops when within retention limits (
STATES_TO_KEEP=900,BLOCKS_TO_KEEP=1800), sothis adds negligible overhead in the common case.
Adapted from Zeam's approach of running fallback pruning
when finalization is behind.
Key design decisions
on_tick,on_block)is unaware of pruning logic
prune methods short-circuit when there's nothing to prune
[latest_finalized.root, latest_justified.root], matching the existingfinalization-gated path
Changes
crates/storage/src/store.rselsebranch inupdate_checkpoints, add 2 testscrates/storage/src/lib.rsStorageReadView,StorageWriteBatch,TableHow to test